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Abstract 

In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic 
covering (SDC), arising in many modern-day web applications, including ad-serving and online 
recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions 
to the well-studied Max-Coverage problem [S]: Given an integer k, X = {I, 2, . . . , n} and I = 
{Si, . . . , S m }, Si C X, find J CI, such that \ J\ < k and (\J Se j S) is as large as possible. The 
two restrictions applied by SDC are: (I) Dynamic: At query-time, we are given a query Q C X 1 
and our goal is to find J such that Qf](\Jsej i s as large as possible; (2) Space- constrained: 
We don't have enough space to store (and process) the entire input; specifically, we have o(mn), 
and maybe as little as 0((m + n)polylog(mn)) space. A solution to SDC maintains a small data 
structure, and uses this datastructure to answer most dynamic queries with high accuracy. We 
call such a scheme a Coverage Oracle. 

We present algorithms and complexity results for coverage oracles. We present deterministic 
and probabilistic near-tight upper and lower bounds on the approximation ratio of SDC as a 
function of the amount of space available to the oracle. Our lower bound results show that to 
obtain constant-factor approximations we need f2(mn) space. Fortunately, our upper bounds 
present an explicit tradeoff between space and approximation ratio, allowing us to determine 
the amount of space needed to guarantee certain accuracy. 



1 Introduction 



The explosion of data and applications on the web over the last decade have given rise to many 
new data management challenges. This paper identifies a fundamental subproblem inherent in 
several Web applications, including online recommendation systems, and serving advertisements 
on webpages. Let us begin with a motivating example. 

Example 1.1. Consider the online movie rental and streaming website, NetfliJ^, and one of their 
users Alice. Based on Alice's movie viewing (and rating) history, Netflix would like to recommend 
new movies to Alice for watching. (Indeed, Netflix threw open a million- dollar challenge on signif- 
icantly improving their movie recommendations Conceivably, there are many ways of devising 
algorithms for recommendation ranging from data mining to machine learning techniques, and in- 
deed there has been a great deal of such work on providing personalized recommendations (see JI]/ 
for a survey). Regardless of the specific technique, an important subproblem that arises is finding 
users "similar" to Alice, i.e., finding users who have independently or in conjunction viewed (and 
liked) movies seen (and liked) by Alice. 

Abstractly speaking, we are given a universal set of all Netflix movies, and Netflix users identified 
by the subset of movies they have viewed (and liked or disliked). Given a specific user Alice, we 
are interested in finding (say k) other users, who together cover a large set of Alice's likes and 
dislikes. Note that for each user, the set of movies that need to be covered is different, and therefore 
the covering cannot be performed statically, independent of the user. In fact, Netflix dynamically 
provides movie recommendations as users rate movies in a particular genre (say comedy), or request 
movies in specific languages, or time periods. Providing recommendations at interactive speed, based 
on user queries (such as for a particular genre), rules out computationally- expensive processing over 
the entire Netflix data, which is very large^ Therefore, we are interested in approximately solving 
the aforementioned covering problem based on a subset of the data. 

The main challenge that arises is to statically identify a subset of the data that would provide 
good approximations to the covering problem for any dynamic user query. 

Note that a very similar challenge arises in other recommendation systems, such as when Alice visits 
an online shopping website like eBa}Q or Amazorjf], and the website is interested in recommending 
products to Alice based on her current query for a particular brand or product, and her prior 
purchasing (and viewing) history. 

The example above can be formulated as an instance of a simple algorithmic covering problem, 
generalizing the NP-hard optimization problem max k- cover [9]. The input to this problem is an 
integer k, a set X = {1, . . . , n}, a family T C 2 X of subsets of X, and query Q C X. Here (X,I) 
is called a set system, X is called the ground set of the set-system, and members of X are called 
elements or items. We make no assumptions on how the set system is represented in the input, 
though the reader can think of the obvious representation by a n x m bipartite graph for intuition. 
This n x m bipartite graph can be stored in 0{nm) bits, which is in fact information-theoretically 
optimal for storing an arbitrary set system on n items and m sets. The objective of the problem is 

www.netflix.com 
2 http://www.netflixprize.com/ 

3 Netflix currently has over 10 millions users, over 100,000 movies, and obviously some of the popular movies have 
been viewed by many users, and movie buffs have rated a large number of movies; Netflix owns over 55 million discs. 

4 www. ebay.com 

5 www. am azon.com 
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to return J C X with \J\ < k that collectively cover as much of Q as possible. Since this problem is 
a generalization of max k-cover, it is NP-hard. Nevertheless, absent any additional constraints this 
problem can be approximated in polynomial time by a straightforward adaptation of the greedy 
algorithm for max k-cover @, which attains a constant factor approximation in 0(mn) time 
However, we further constrain solutions to the problem as follows, rendering new techniques 
necessary. 

From the above example, we identify two properties that we require of any system that solves 
this covering problem: 

1. Space Constrained: We need to (statically) preprocess the set system and store a 
small sketch (much smaller than 0(mn)), in the form of a data structure, and discard the 
original representation of This can be thought of as a form of lossy compression. We 
do not require the data structure to take any particular form; it need only be a sequence of 
bits that allows us to extract information about the original set system (X,T). For instance, 
any statistical summary, a subgraph of the bipartite graph representing the set system, or 
other representation is acceptable. 

2. Dynamic: The query Q is not known a-priori, but arrive dynamically. More precisely: Q 
arrives after the data structure is constructed and the original data discarded. It is at that 
point that the data structure must be used to compute a solution J to the covering problem. 

We call this covering problem (formalized in the next section) the Succinct Dynamic Covering 
(SDC) problem. Moreover, we call a solution to SDC a Coverage Oracle. A coverage oracle consists 
of a static stage that constructs a datastructure, and a dynamic stage that uses the datastructure 
to answer queries. 

Next we briefly present another, entirely different, Web application that also needs to confront 
SDC . In addition, we note that there are several other applications facing similar covering problems, 
including gene identification [8] , searching domain-specific aggregator sites like YelpEI, topical query 
decomposition [1] , and search-result diversification [H [7] . 

Example 1.2. Online advertisers bid on (1) webpages matching relevancy criteria and (2) typically 
target a certain user demographic. Advertisements are served based on a combination of the two 
criterion above. When a user visits a particular webpage, there is usually no precise information 
about the users' demographic, i.e., age, location, interests, gender, etc. Instead, there is a range 
of possible values for each of these attributes, determined based on the search query the user issued 
or session information. Ad-servers therefore attempt to pick a set of advertisements that would be 
of interest (i.e., "cover") a large number of users; the user demographic that needs to be covered 
is determined by the page on which the advertisement is being placed, the user query, and session 
information. Therefore, ad-serving is faced with the SDC problem. The space constraint arises 
because the set system consisting of all webpages, and each user identified by the set of webpages 
visited by the user is prohibitively large to store in memory and process in real-time for every single 
page view. The dynamic aspect arises because each user view of each page is associated with a 
different user demographic that needs to be covered. 

6 The greedy algorithm for max k-cover, adapted to our problem, is simple: Find the set in X covering as many 
uncovered items in Q as possible, and repeat this k times. This can clearly be implemented in 0(mn) time, and has 
been shown to yield a e/(e — 1) approximation, 
www . yelp . com 
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1.1 Contributions and Outline 

Next we outline the main contributions of this paper. 

• In Section [2] we formally define the succinct dynamic covering (SDC) problem, and summarize 
our results. 

• In Section [3] we present a randomized coverage oracle for SDC . The oracle is presented as a 
function of the available space, thus allowing us to tradeoff space for accuracy based on the 
specific application. Unfortunately, the approximation ratio of this oracle degrades rapidly 
as space decreases; However, the next section shows that this is in fact unavoidable. 

• In Section H] we present a lowerbound on the best possible approximation attainable as a 
function of the space allowed for the datastructure. This lowerbound essentially matches the 
upperbound of Section [3J though with the caveat that the lowerbound is for oracles that do 
not use randomization. We expect the lowerbound to hold more generally for randomized 
oracles, though we leave this as an open question. 

Related work and future directions are presented in Section [5l 

1.2 Related Work 

Our study of the tradeoff between space and approximation ratio is in the spirit of the work of 
Thorup and Zwick |14| on distance oracles. They considered the problem of compressing a graph 
G into a small datastructure, in such a way that the datastructure can be used to approximately 
answer queries for the distance between pairs of nodes in G. Similar to our results, they showed 
matching upper and lower bounds on the space needed for compressing the graph subject to pre- 
serving a certain approximation ratio. Moreover, similarly to our upperbounds for SDC, their 
distance oracles benefit from a speedup at query time as approximation ratio is sacrificed for space. 

Previous work has studied the set cover problem under streaming models. One model stud- 
ied in [3j [10] assumes that the sets are known in advance, only elements arrive online, and, the 
algorithms do not know in advance which subset of elements will arrive. An alternative model 
assumes that elements are known in advance and sets arrive in a streaming fashion [13] . Our work 
differs from these works in that SDC operates under a storage budget, so all sets cannot be stored; 
moreover, SDC needs to provide a good cover for all possible dynamic query inputs. 

Another related area is that of nearest neighbor search. It is easy to see that the SDC problem 
with k = 1 corresponds to nearest neighbor search using the dot product similarity measure, i.e., 
simciot{ x ^y) = ■ However, following from a result from Charikar [6], there exists no locality 

sensitive hash function family for the dot product similarity function. Thus, there is no hope that 
signature schemes (like minhashing for the Jaccard distance) can be used for SDC . 

2 SDC 

We start by defining the succinct dynamic covering (SDC) problem in Section 12.11 Then, in 
Section 12.21 we summarize the main technical results achieved by this paper. 
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2.1 Problem Definition 



We now formally define the SDC problem. 

Definition 2.1 (SDC). Given an offline input consisting of a set system (X,Z) with n elements 
(a.k.a items,) X and m sets I, and an integer k > 1, devise a coverage oracle such that given a 
dynamic query Q C X, the oracle finds a J CI such that \J~\ < k and (Use J" S)C\Q ^ s as ^ ar 9 e 
as possible. 

Definition 2.2 (Coverage Oracle). A Coverage Oracle for SDC consists of two stages: 

1. Static Stage: Given integers m,n,k, and set system (X,I) with \X\ = n and |X| = m, build 
a datastructure V. 

2. Dynamic Stage: Given a a dynamic query Q C X , use V to return J C X with \ J\ < k as 
a solution to SDC. 

Note that our two constraints on a solution for SDC are illustrated by the two stages above. 
(1) We are interested in building an offline data structure X>, and only use V to answer queries. 
Typically, we want to maintain a small data structure, certainly o(mn), and maybe as little as 
0((m + n)polylog(mn)) or even 0(m + n). Therefore, we cannot store the entire set system. (2) 
Unlike the traditional max-coverage problem where the entire set of elements X need to be covered, 
in SDC we are given queries dynamically. Therefore, we want a coverage oracle that returns good 
solutions for all queries. 

Given the space limitation of SDC, we cannot hope to exactly solve SDC (for all dynamic input 
queries). The goal of this paper is to explore approximate solutions for SDC, given a specific space 
constraint on the offline data structure T>. We define the approximation ratio of an oracle as the 
worst-case, taken over all inputs, of the ratio between the coverage of Q by the optimal solution and 
the coverage of Q by the output of the oracle. We allow the approximation ratio to be a function 
of n, m, and k, and denote it by a(n,m, k). 

More precisely, given a coverage oracle A, if on inputs k,X,I,Q (where implicitly n = \X\ 
and m = \X\) the oracle A returns J C X, we denote the size of the coverage as A(k, X ,I,Q) := 
I (Use j- S)f}Q\- Similarly, we denote the coverage of the optimal solution by OPT(k, X,I, Q) := 
max{|(U Se j-„ S I )P)Q| : J* C X, \ J*\ < k}. We then express the approximation ratio a(n,m,k) as 



Where the maximum above is taken over set systems (X,l) with \X\ = n and |X| = m, and 
queries Q C X. 

We will also be concerned with randomized coverage oracles. Note that, when we devise ran- 
domized coverage oracle, we use randomization only in the static stage; i.e. in the construction 
of the datastructure. We then let the expected approximation ratio be the worst case expected 
performance of the oracle as compared to the optimal solution. 



follows. 



a(n, m, k) = max 



OPT(k,X,l,Q 
A(k,X,l,Q) 



Q) 




(1) 
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Table 1: Summary of results for SDC giving the approximation-ratio, the space constraint on the 
coverage oracle, and whether the nature of the bound: upper bound (UB) or lower bound (LB) 
and deterministic (Det.) or randomized (Rand.) 



Approximation Ratio 


Storage 


Bound 


0(mm(f,,/f)) 


0(n) 


Det. UB 


°( min (7pVl)) 




Rand. UB 






Det. LB 



The expectation in the above expression is over the random coins flipped by the static stage 
of the oracle, and the maximization is over X,X,Q as before. We elaborate on this benchmark in 
Section [3l 

We study the space-approximation tradeoff; i.e., how the (expected) approximation ratio im- 
proves as the amount of space allowed for T> is increased. In our lowerbounds, we are not specifically 
concerned with the time taken to compute the datastructure or answer queries. Therefore, our 
lowerbounds are purely information-theoretic: we calculate the amount of information we are re- 
quired to store if we are to guarantee a specific approximation ratio, independent of computational 
concerns. Our lowerbounds are particularly novel and striking in that they assume nothing about 
the datastructure, which may be an arbitrary sequence of bits. We establish our lowerbounds via 
a novel application of the probabilistic method that may be of independent interest. 

Even though we focus on space vs approximation, and not on runtime, fortunately the cover- 
age oracles in our upperbounds can be implemented efficiently (both static and dynamic stage). 
Moreover, using our upperbounds to trade approximation for space yields, as a side-effect, an im- 
provement in runtime when answering a query. In particular, observe that if no sparsification of 
the data is done up- front, then answering each query using the standard greedy approximation 
algorithm for max k-cover [IT] takes 0(mn) time. Our oracles, presented in Section El spends 
0(mn) time up- front building a data structure of size 0(b), where b is a parameter of the oracle 
between n and nm. In the dynamic stage, however, answering a query now takes 0(b), since we 
use the greedy algorithm for max k-cover on a "sparse" set system. Therefore, the dynamic stage 
becomes faster as we decrease size of the data structure. In fact, this increase in speed is not 
restricted to an algorithmic speedup as described above. It is likely that there will also be speedup 
due to architectural reasons, since a smaller amount of data needs to be kept in memory. Therefore, 
trading off approximation for space yields an incidental speedup in runtime which bodes well for 
the dynamic nature of the queries. 

2.2 Summary of results 

Table [U summarizes the main results obtained in this paper for SDC input with n elements, m sets, 
and integer k > 1. The lower bound in the table is for any nonnegative constants 61,82 not both 
0, and the randomized upperbound is parameterized by e with < e < 1/2. The upper and lower 
bounds are developed in Sections [3] and H] respectively. 
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3 Upper Bounds 



In this section, we show a coverage oracle that trades off space and approximation ratio. We 
designate a tradeoff parameter e, where < e < 1/2. For any such e, we get an O ^ mm (^_'V^) ^_ 

approximate coverage oracle that stores 0(nm 1_2e ) bits. Therefore, setting a small value of e 
achieves a better approximation ratio, at the expense of storage space. As is common practice, 
we use 0() to denote suppressing polylogarithmic factors in n and m; this is reasonable when the 
guarantees are super-polylogarithmic, as is the case here. 

The oracle we show is randomized, in the sense that the static stage flips some random coins. 
The datastructure constructed is a random variable in the internal coin flips of the static stage of 
the oracle. We measure the expected approximation ratio (a.k.a approximation ratio, when clear 
from context) of the oracle, as defined in Equation ([1]). For every fixed query Q independent of 
the random coins used in constructing the datastructure, this ratio is attained in expectation. In 
other words, our adversarial model is that of an oblivious adversary: someone trying to fool our 
oracle may choose any query they like, but their choice cannot depend on knowledge of the random 
choices made in constructing the datastructure. 

In Section H] we will see that our oracle attains a space-approximation tradeoff that is essentially 
optimal when compared with oracles that are deterministic. In other words, no deterministic oracle 
can do substantially better. We leave open the questions of whether a better randomized oracle is 
possible, and whether an equally good deterministic oracle exists. 

3.1 Main Result and Roadmap 

The following theorem states the main result of this section. 

Theorem 3.1. For every e with < e < 1/2, there is a randomized coverage oracle for SDC that 
achieves an O ( min (^_' V") ^ approximation and stores 0{nm 1 ~ 2<i ) bits. 

The remainder of this section, leading up to the above result, is organized as follows. Before proving 
Theorem 13.11 to build intuition we show in Section 13.21 (Remark I3.2() a much simpler deterministic 
oracle, with a much weaker approximation guarantee. Then, we prove Theorem 13.11 in two parts. 
First, in Section[3]3j we show a randomized coverage oracle that stores 0(nm 1_2e ) bits and achieves 
an 0(m e /Vk) approximation in expectation. Then, in Section 13.41 we show a deterministic oracle 
that achieves a 0{^/n/^/k) approximation and stores 0(n) bits. Combining the two oracles into a 
single oracle in the obvious way yields Theorem 13.11 

3.2 Simple Deterministic Oracle 

Remark 3.2. There is a simple deterministic oracle that attains a m/k approximation with 0(n) 
space. The static stage proceeds as follows: Given set system (X,l), for each i G X we "remember" 
one set S E I with i E S (breaking ties arbitrarily). In other words, for each S E X we define 

S C S such that : S G x| is a partition of X. We then store the "sparsified" set system 

(^X,Z = (s : S G . It is clear that this can be done in linear time by a trivial greedy algorithm. 
Moreover, (X,I) can be stored in 0(n) space as a n x m bipartite graph with n edges. 
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The dynamic stage is straightforward: when given a query Q, we simply return the indices of 
the k sets in X that collectively cover as much of Q as possible. It is clear that this gives a m/k 
approximation. Moreover, since X is a partition of X , it can be accomplished by a trivial greedy 
algorithm in polynomial time. 

Next we use randomization to show a much better, and much more involved, upperbound that 
trades off approximation and space. 

3.3 An 0(m € /\/k) Approximation with 0(nm 1_2t ) Space 

Consider the set system (X,I), where X is the set of items and X is the family of sets. We 
assume without loss that each item is in some set. We define a randomized oracle for building a 
datastructure, which is a "sparsified" version of (X,X). Namely, for every S G X we define S C S, 

and store the set system (^X,I = ^. We require that (X,X) can be stored in 0(nm 1_2e ) 

space. We construct the datastructure in two stages, as follows. 

• Label all items in X "uncovered" and all sets in I "unchosen" 

• Stage 1: While there exists an unchosen set S £ X containing at least — ^-7= uncovered items 

— Let 5 be the set of unchosen items in S. 

— Relabel all items in S as "covered" and "significant" 

— Relabel S as "chosen" and "significant" 

• Stage 2: For every remaining "unchosen" set S 

— Choose "uncovered" items S C S uniformly at random from the uncovered items in 
S (if fewer than such items, then let S be all of them). 

— Relabel each item in S as "covered" and "insignificant" 

— Relabel S as "chosen" and "insignificant" 

• Label every uncovered item as "uncovered" and "insignificant" 

When presented with a query Q C X, we use the stored datastructure (X,X) in the obvious 
way: namely, we find Si, . . . , Sk & X maximizing |(|Ji=i fl Q\i an d return the name of the cor- 
responding original sets S\, . . . , S^- However, this problem cannot be solved exactly in polynomial 
time in general. Nevertheless, we can instead use the greedy algorithm for max-k-cover to get a 
constant-factor approximation [11] : this will not affect our asymptotic guarantee on the approx- 
imation ratio. The following two lemmas complete the proof that the above oracle achieves an 
0(m e / 'y/k) approximation with 0(nm 1 ~ 2e ) space. 

Lemma 3.3. The datastructure (X,X) can be stored using 0{nm}~ 2,i ) bits. 

Proof. We store the set system as a bipartite graph representing the containment relation between 
items and sets. To show that the bipartite graph can be stored in the required space, it suffices 
to show that (X,X) is "sparse"; namely, that the total number of edges (x,S) S X x X such that 
x G S is 0(um 1 " 2e ). We account for the edges created in stages 1 and 2 separately. 
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1. Every significant item is connected to a single set. This creates at most n edges. 

2. For every insignificant set, we store at most nm~ 2e items. This creates at most mnm~ 2t = 
nm i~2e ec [g es _ 

□ 

Lemma 3.4. For every query Q, the oracle returns sets S±, . . . , such that 



E „ ( ^ ng|] ,KUkM 



for any Sf,...,S* k el. 



Note that S\ , . . . , Sk are random variables in the internal coin- flips of the static stage that 
constructs the datastructure. The expectation in the statement of the lemma is over these random 
coins. 

Proof. We fix an optimal choice for SJ, . . . , S* k el, and denote OPT = |((jf =1 S*) f] Q\. Since, 
by construction, S C S for all S e 1, it suffices to show that the output of the oracle satisfies 
0(rSJ\fk) * n ex P ec t a ti° n - Moreover, since the dynamic stage algorithm finds a 
constant factor approximation to max{|((J^ =1 Si) Q\ : S\,...,Sk £ 1}, it is sufficient to show 
that there exists S u . . . , S k e 1 with E[| ((J- = i Si) f] Q\] > Q ^J^ y 

We distinguish two cases, based on whether most of the items flji=i ) D Q covere d by the 
optimal solution are in significant or insignificant sets. We use the "significant" and "insignificant" 
designation as used in the static stage algorithm. Moreover, we refer to S £ 1 as significant 
(insignificant, resp.) when the corresponding S el is significant (insignificant, resp.). 

1. At least half of (Uf=i ^*)f]Q are significant items: Notice that, by construction, there 
are at most m e \/~k significant sets in 1. Moreover, the significant items are precisely those 
covered by the significant sets of X, and those sets form a partition of the significant items. 
Therefore, by the pigeonhole principle there are there are some Si,...,Sk e 1 such that 
Uf=i &i contains at least an ]*~- = fraction of the significant items in (Ui=i &i) D Q- 
This gives the desired 0(m t /Vk) approximation. 

2. At least half of (Ui=i S*)f]Q are insignificant items: In this case, at least half the items 
(Ui=i S*)f]Q covered by the optimal solution are contained in the insignificant members of 
{SI , . . . , S k }. Recall that any insignificant set in 1 contains at most J^- insignificant items. 

Therefore, the algorithm includes each element of an insignificant S* in S* with probability 
at least / , which is at least Vk/m e . Thus, every insignificant item in (ljf =1 S*) is in 

(Ui=i $i) w ^ n probability at least Vk/m e . This gives that the expected size of (ljf=i ^*) f] Q 
is at least Q ^J^ ■ Taking Si = S* completes the proof. 

□ 
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3.4 An 0(^n/k) Approximation with 0(n) Space 

This coverage oracle is similar to the one in the previous section, though is much simpler. Moreover, 
it is deterministic. Indeed, we construct the datastructure by the following greedy algorithm that 
resembles the greedy algorithm for max-k-cover 

• Label all items in X "uncovered" and all sets in X "unchosen" 

• While there are unchosen sets 

— Find the unchosen set S G X containing the most uncovered items 

— Let S be the set of uncovered items in S. 

— Relabel all items in S as "covered" 

— Relabel S as "chosen" 

Observe that X is a partition of X. When presented with a query Q C X, we use the datastruc- 
ture (X,I = <^S : S G Xj) in the obvious way. Namely, we find the sets Si,...,Sk G X maximizing 

l(Ui=i Si) fl Q\i an d output the corresponding non-sparse sets Si, ■ ■ ■ , Sk- This can easily be done 
in polynomial time by using the obvious greedy algorithm, since X is a partition of X. 

Note that the oracle described above is very similar to the oracle from Section [372J The dynamic 
stage is identical. The static stage, however, needs to build the partition using a specific greedy 
ordering - as opposed to the arbitrary ordering used in Section 13.21 The following two Lemmas 
complete the proof that the oracle achieves an 0(y/n/k) approximation with <D(n) space. 

Lemma 3.5. The datastructure (X,I) can be stored using 0(n) bits 

Proof. Observe that each item is contained in exactly one S G X. Therefore, the bipartite graph 
representing the set system (X,I) has at most n edges. This establishes the Lemma. □ 

Lemma 3.6. For every query Q, the oracle returns sets Si, . . . , Sk with 



(U* )n «i>M^g«! 



for any S*,...,S^ G X. 



Proof. Fix an optimal choice of S*[, . . . , SI, and denote OPT = |((jf=i &i) fl Q\- Recall that the 

oracle finds S\ , . . . , Sk G X maximizing | QJi=i fl Q\ > an d then outputs the corresponding original 

sets Si, . . . , Sk- 
it suffices to show that there are some Si, . . . , Sk G X with |(Ui=i <%) f]Q\ > OPT /0(\Jn/k). 

We distinguish two cases, based on whether most of (Uj=i S*)f]Q are in big or small sets in X. 
Recall that X forms a partition of X. We say S G X is "significant" if \S\ > y/n/k, otherwise S 

is "insignificant" . Similarly, we say an item i G X is "significant" if it falls in a significant set in X, 



otherwise it is "insignificant". Notice that there are at most —¥= = ynk significant sets. 

y/ri/k 

First, we consider the case where at least half the items in (\J^ = i)S* f] Q are significant. Since 
there at most \fnk significant sets in X, by the pigeonhole principle there are k of them that 



9 



collectively cover a k/y/nk = \fkfn fraction of all significant items in (Uj=i ^X) D Q- This would 
guarantee the 0{y/ n/k) approximation, as needed. 

Next, we consider the case where at least half of (Uj=i ST) fl Q are insignificant. By examining 
the greedy algorithm of the static stage, it is easy to see that each S 6 I contains at most \fnjk 
insignificant items. Therefore, there are at most k ■ \fnjk = \fnk~ insignificant items in (Ui=i S*). 
Therefore we deduce that OPT = \ S*)f] Q\ < 2vkn. Since the optimal covers 0(ykn) items in 



Q, it suffices for a 0(yjn/k) approximation to show that there are Si, ■ ■ ■ ,Sj~ € Z that collectively 
cover k items of Q. It is easy to see that this is indeed the case, since X is a partition of X . This 
completes the proof. □ 

4 Lower Bounds 

This section develops lower bounds for the SDC problem. We consider deterministic oracles that 
store a datastructure of size b(n, m, k) for set systems with n items, m sets, maximum number 
of allowed sets k. Moreover, we assume that n < b{n, m, k) < nm, since no nontrivial positive 
result is possible when b(n,m, k) = o(n), and a perfect approximation ratio of 1 is possible when 
b(n, m, k) = £l(nm). 

4.1 Main Result and Roadmap 

The main result of this section is stated in the following theorem, which says that our randomized 
oracle in the previous section achieves a space-approximation tradeoff that essentially matches the 
best possible for any deterministic oracle. 

Theorem 4.1. Consider any deterministic oracle that stores a datastructure of size at most 
b(n,m,k) bits, where n < b(n,m,k) < nm. Let e(n,m,k) be such that b(n,m,k) = nm 1 ~ 2t ^ n ' m ' k \ 
When m € ( n ' m '^ < ^/n, the oracle does not attain an approximation ratio of 0( m ( fc '^ > — ) f or an V 
constant 5 > 0. Moreover, when yfn < m e ^ n,m,k ^ the oracle does not attain an approximation ratio 

The proof of the theorem above is somewhat involved. Therefore, to simplify the presentation 
we prove in Section 14.21 a slight simplification of Theorem 14.11 that captures all the main ideas: 
Our simplification sets k = 1, and proves the 0( m£( ^'^ ) - ) approximation ratio, for m e ^ n ' m ' k ^ < 

yjn. Then, in Section 14.31 we prove the approximation ratio for the case of < m< n ' m ' k \ still 
maintaining k = 1. Finally, in Section 14.4^ we demonstrate how to modify our proofs for any k, 
yielding Theorem 14.11 

We fix 5 > 0. For the remainder of the section, we use b and e as shorthand for b(n, m, k) and 
e(n,m,k), respectively. We let ct(n, m, k) be the approximation ratio of the oracle, and use a as 
shorthand. Observe that < e < 1/2. 

4.2 Proof of a Simpler Lowerbound 

We simplify Theorem l4.1l by assuming k = 1 and m € < y/n. The result is the following proposition, 
stated using the shorthand notation described above. 
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Proposition 4.2. Fix k = 1 and parameter e with < e < 1/2. Assume m € < yfn. Consider any 
deterministic oracle that stores a datastructure of size at most b = nm l ~ 2t bits. The oracle does 
not attain an approximation ratio of 0(m e ~ s ) for any constant 5 > 0. 

We assume the approximation ratio a attained by the oracle is 0(m e ~ s ) and derive a contradic- 
tion. The proof uses the probabilistic method (see [2]). We begin by defining a distribution on set 
systems, and then go on to show that this distribution "fools" a small coverage oracle with positive 
probability. 

4.2.1 Denning a Distribution D on Set Systems 

We will show that there is a set system (X,I) and a query Q that forces the algorithm to output 
a set S £ X that is not within a from optimal. We use the probabilistic method. Namely, we 
exhibit a distribution D over set systems (X,I) such that, for every deterministic oracle storing 
a datastructure of size b, there exists with non-zero probability a query Q for which the oracle 
outputs a set of approximation worse than a. To show this, we draw two set systems i.i.d from 
D, and show that with non-zero probability both the following hold: the two set systems are not 
distinguished by the coverage oracle, and moreover there exists a query Q that requires that the 
algorithm return different answers for the two set systems for a 0(m e ~ s ) approximation. 

We define D as follows. Given the ground set X = {1, . . . ,n}, we let X = {A{\™ =1 and draw 
Ai, . . . , A m i.i.d as follows: We let Ai be a subset of X of size nm~ e drawn uniformly at random. 

4.2.2 Sampling twice from D and collisions 

Next, we draw two set systems (X,I = {Ai}™^^ and (X,l' = {A^} 7 ^^ i.i.d from D, as discussed 
above. First, we lowerbound the probability that (X,I) and (X,I') are not distinguished by the 
coverage oracle. We call such an occurence a "Collision". 

Lemma 4.3. The probability that the same datastructure is stored for (X,I) and [X,T') is at least 

Proof. There are 2 b possible datastructures. Let pi denote the probability that, when presented 
with random (X,I) ~ D, the oracle stores the z'th datastructure. We can write this probability 

of "collision" of the two i.i.d samples (X,I) and (X,Z') as Ya=ip1- However, since YliPi = 1> 
this expression is minimized when pi = 2~ b for all i. Plugging into the above expression gives a 
lowerbound of 2~ b , as required. □ 

4.2.3 Fooling Queries and Candidates 

Next, we lowerbound the probability that a query Q exists requiring two different answers for (X,I) 
and (X,I') in order to get the desired a = 0(m e ~ s ) approximation. We call such a query Q a 
fooling query. We define a set of queries that are "candidates" for being a fooling query: A set 
Q C X is called a candidate query if Q = Ai [J A\, for some i ^ i! . In other words, a query is a 
candidate if it is the union of a set from (X,I) and a set from (X,I') with different indices. 

Ideally, candidate Q = Ai[\ A\, would be a fooling query by forcing the oracle to output i for 
(X,I) and i' for (X,I ! ) in order to guarantee the desired approximation. However, this need not 
be the case: consider for instance the case when, for some j ^ i,i', both Aj and Aj have large 
intersection with Q, making it ok to output j for both. We will show that the probability that none 
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of the candidate queries is a fooling query is strictly less than 2 when n and m are sufficiently 
large. Doing so would complete the proof: collision occurs with probability > 2 , and a fooling 
query exists with probability > 1 — 2~ b , and therefore both occur simultaneously with positive 
probability. This would yield the desired contradiction. 

4.2.4 The Probability that None of the Candidates is Fooling is Small 

We now upperbound the probability that none of the candidates is a fooling query. Observe that 
if candidate Q = A4 |J A' { , is not a fooling query, then there exists A G \ {Ai,^,} with 

\A P| Q\ > nm~ e /a. Therefore one of the following must be true: 

1. There exists A G l\Jl'\{A i ,A' il } with |Af|^i| > nm~ e /2a = tt(nm- 2e+s ). 

2. There exists A G l\Jl'\{A i ,A' i/ } with |Af|^/| > nrrT e /2a = Q{nm- 2e+s ). 

Therefore, if none of the candidates were fooling queries, then there are many "pairs" of sets in 
I{JI' that have an intersection substantially larger than the expected size of nm~ 2e . This seems 
very unlikely. Indeed, the remainder of this proof will demonstrate just that. 

If none of the candidates are fooling queries, then by examining (1) and (2) above we deduce 
the following. There exists H a set of pairs PC (2{Jl r ) x (I\JI') such that: 

1. \P\>m-2 = n(m) 

2. The undirected graph with nodes I [J I' and edges P is bipartite. Moreover, every node in 
the left part has degree at most 1. Thus P is acyclic. 

3. If (B, C) e P then \B f| C| > !l(nm- 2£+i ) 

We now proceed to bound the probability of existence of such a P, and in the process also 
bound the probability that none of the candidate queries are fooling. Recall that members of 
X(JZ' are drawn i.i.d from the uniform distribution on subsets of X of size nm~ e . For every pair 
(B,C) £ we let 1Z(B,C) = \Bf]C\ denote the size of their intersection. It is easy to see 

the random variables {1Z(B, C)} B cex\Ji' are pairwise independent. Therefore, any acyclic set of 
pairs is mutually independent, by basic probability theory. Thus, if we fix a particular P satisfying 
(1) and (2), the probability that P satisfies condition (3) is at most 

Yl Pr[K{B, C) > n(nm- 2e+5 )} 
(B,c)eP 

We now want to estimate the probability that the intersection of B and C is a factor Sl(m 5 ) 
more than its expectation of nm~ 2e . Therefore, we consider an indicator random variable Yi for 
each i € X, designating wheter i G B n C. If Yi were independent, we could use Chernoff bounds 

8 Consider constructing P as follows: For candidate query Q = Ai \AA' 2 , find the set in X(JI' \ {Ai,^} with a 
large intersection with one of A\ or A' 2 as in (1) or (2). Say for instance we find that A-j has a large intersection 
with Ai. We include (j4,i,Ar) in P, mark both Ai and A-j as "touched", and designate A\ a "left" node and A7 a 
"right" node. Then, we repeat the process with some candidate Q' = Ai U A[, for some "untouched" Ai and A\, . We 
keep repeating until there are no such candidates. Throughout this greedy process, we mark at most two members of 
T{JX' as "touched" for every pair we include in P. Note that some Ai may be "touched" more than once. As long 
as there are at least 2 untouched sets in each of I and X' , the algorithm may continue. 
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to bound the probability that 1Z(B,C) is large. Fortunately, it is easy to see that the Yi's are 
negatively-correlated: i.e., for any L C {1, ... , n}, we have Pr[/\ ieL Yi = 1] < nieL^ 1- !^ = • 
Therefore, by the result of [12], if we "pretend" that they are independent by approximating their 
joint-distribution by i.i.d bernoulli random variables, we can still use Chernoff Bounds to bound 
the upper-tail probability. Therefore, using Chernoff bounds^ we deduce that the probability that 
the intersection of B and C is a factor VL(m s ) more than the expectation of nm~ 2e is at most 
2 -(fi(m 5 )-i)nm- 2e < 2 - !i (™" 2£+< ). Therefore, the probability that the fixed P satisfies condition (3) 
is at most 

(B,c)eP 

< 2-!J(™ 1_2£+S ) 

Now, we can sum over all possible choices for P satisfying (1) and (2) to get a bound on the 
existence of a P satisfying (1), (2) and (3). It is easy to see that there are at most m m choices for 
P that satify (1) and (2). Using the union bound, we get the following bound on the existence of 
such a P. 

yyijn . 2 — ^( nml_2£+l5 ) < 2 m i°g m— ^( nml_2£+l5 ) 

< 2-J2(nm 1 - 2e+i ) 

Where the last inequality follows by simple algebraic manipulation from our assumption that 
rrf < y/n and 5 > 0, when n and m are sufficiently large. Recall that, by our previous discussion, 
this expression also upperbounds the probability that none of the candidate queries are fooling 
queries. But, when n and m are sufficiently large, this is strictly smaller than 2~ b = 2~ nm 
Thus, by our previous discussion, this completes the proof of Proposition 14.21 



4.3 Modifying the proof for the case y/n < rrf 

We maintain the assumption that k = 1, and show how to modify the proof of Proposition 14.21 for 
the case when y/n < rrf. 

Proposition 4.4. Fix k = 1 and parameter e with < e < 1/2. Assume y/n < rrf . Consider any 
deterministic oracle that stores a datastructure of size at most b = nm l ~ 2t bits. The oracle does 
not attain an approximation ratio o/0(n 1 / 2-5 ) for any constant 5 > 0. 

Instead of replicating almost the entire proof of Proposition 14.21 we instead point out the key 
changes necessary to yield a proof of 14.41 and leave the rest as an easy excercise for the reader. 

The proof proceeds almost identically to the proof of Proposition 14.2^ with the following main 
changes: 

• Modifications to Section 14. 2. It When defining D, we let each Ai be a subset of X of size 
•y/n instead of nm~ e . 

• We perform similar calculations throughout, accomodating the above modification to the size 
of Ai. 

9 We use the following version of the Chernoff Bound: Let Xi , . . . , X n be independent bernoulli random variables, 
and let X = £\ X t . If ELY] = /x and A > 2e - 1, then Pr[X > (1 + A)(m] < 2~ Am . 
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• Modifications to Section 14.2.41 We eventually arrive at an upper bound of 2~ mn on the 
probability that none of the candidate queries are fooling. Using the assumption m € > yfn 
and the fact that b = nm 1 ~ 2e , a simple algebraic manipulation shows that this bound is stricly 
less than 2~ b . This completes the proof, as before. 

4.4 Modifying the proof for arbitrary k 

In this section, we generalize Proposition 14.21 to arbitrary k. The generalization of Proposition 14.41 
to arbitrary k is essentially identical, and therefore we leave it as an exercise for the reader. We 
now state the generalization of Proposition 14.21 to arbitrary k. 

Proposition 4.5. Let parameter e be such that < e < 1/2. Assume m e < y/n. Consider any 
deterministic oracle that stores a datastructure of size at most b = nm l ~ 2t bits. The oracle does 
not attain an approximation ratio of 0( ™^- ) for any constant S > 0. 

The proof of Proposition 14.51 follows the outline of the proof of Proposition 14.21 The necessary 
modifications to the proof of Proposition 14.21 are as follows: 

• Modifications to Section 14.2.11 We define distribution D as before, except that we let 
each Ai be a subset of X of size nm 

• Modifications to Section 14.2.21 Instead of sampling from D twice, we sample 2k + 1 times 
to get set systems (X,! 1 ), (X,I 2 ), . . . , (X ,I 2k+1 ). This changes the probability of collision 
of Lemma 14.31 to 2~ 2kb . Here, collision means that all 2k + 1 samples from D are stored as 
the same datastructure by the static stage of the oracle. 

• Modifications to Section 14.2.31 We now define a fooling query analogously for general k: 
A query Q is fooling if there is no single index i such that returning the i'th set gives a good 
approximation for all the set systems (X,! 1 ), . . . (X,I 2k+1 ). 

Moreover, we analogously define candidate queries: We use A^ to denote the b'th set in set 
system (X,l a ). We say Q C X is a candidate if Q = \JA% [j . . .(J A^ 1 , where indices 
£%,..., ik+i are distinct, and indices i%, . . . , ik+i are distinct. In other words, Q is a fooling 
query if it is the union of k + 1 sets from k + 1 distinct set systems and k + 1 distinct indices 
in those set systems. 

• Modifications to Section 14.2.41 : Similarly, if a candidate Q = A** U • • • U ^ s not 

a fooling query, then there is some A £ (I 1 (J • • • T 2k+1 ) \ with |yip|Q| > nm~ e /a. 

i- ^ i ■ 

Therefore, for one of the components A^ of Q we have that |^4P|^4^| > nm~ e /ka. Plugging 

in the approximation ratio a = m e ~ s jk\fk we have that |^4Q^4^| > nm~ 2e+s \/k. It is not 
too hard to see that we can construct P similarly with 

1. \P\ > k(m-k) = n(km)^ 

2. The undirected graph with nodes I [J I' and edges P is bipartite. Moreover, every node 
in the left part has degree at most 1. Thus P is acyclic. 

This is not true when k is almost equal to m. However, the theorem becomes trivially true when k > m 1/6 , so 
we can without loss assume that k is not too large. 
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3. If (B,C) e P then \Bf]C\> n{nm- 2e+s Vk) 



Continuing with the remaining calculations in this section almost identically gives a bound 
of 2~^( knml ' t+ ) on the probability of existance of a fixed P. The number of such P is at 
most (km) km , therefore a similar calculation gives a bound of 

on the existence of any such P. As before, this completes the proof. 

5 Conclusions and Future Work 

This paper introduced and studied a fundamental problem, called SDC, arising in many large-scale 
Web applications. A summary of results obtained by the paper appear in Table [Q (Section I2.2|) . 
The main specific open question that arises is whether there is a deterministic oracle that is as good 
as the randomized oracle proposed in Section [3j More generally, a detailed analysis of practical 
subclasses of SDC seems to hold promise. 
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